Introduction of RNNs into solving natural language processing tasks was arguably considered to be the most significant breakthrough in the domain until 2017 when Google researchers introduced a novel architecture in the paper by Vaswani et al. “Attention is All You Need.” This novel architecture, termed transformers, revolutionized research in natural language processing. They tried to solve sequence-to-sequence tasks to handle long-range dependencies efficiently.
Since then, transformers-based models have outperformed all existing models consistently on several benchmarking NLP tasks. They deal with long-range dependencies efficiently and robustly. Further, since they are not sequential (like RNNs), this enables us to parallelize and train faster with much more training data as compared to RNNs. As a result, transformers have become ubiquitous in the natural language processing community, used extensively by researchers in academia and industry.
The transformer is one of the first transduction models relying entirely on self-attention to convert input sequences to output sequences. Self-attention is a methodology used to establish a relationship between different elements in a sequence. In other words, it allows a model to look at the other words in the sequence to get a better understanding of a word. It is based on the premise that each word in a sentence contains some relevant information necessary for computations, and at each step, it uses the previously generated symbols when generating the next. For example, if we want to translate the given sentence:
" The animal didn’t cross the street because it was too tired. "
What does ‘it’ refers to here, street or animal? It may be a simple question for human beings but it is an arduous task for algorithms. However, when using the self-attention mechanism, ‘it’ gets associated with ‘animal.’ This phenomenon happens because when the model is processing a word, it can look at other words in a sequence to get more cues about the context of the information.
Mathematically, self-attention in models is computed using query vector, key vector, and value vector. It is a mapping between a query and a set of key-value pairs to the output vector. In transformers, the output is computed using ‘scaled dot-product attention,’ where the input consists of values of dimension dv and queries, and keys of dimension are represented by dk.Attention(Query Q, Key K, Value V) = softmax(QKT /√ dk )V
Transformers use multi-attention heads where ‘scaled dot-product attention’ is computed multiple times. This feature allows the model to parallelly attend to information present in different input representation subspaces at different positions.
The architecture of transformers is an encoder-decoder based architecture. In this model, encoder maps an input sequence to a sequence of continuous representation. These sequences are then converted to output by the decoder. Since the model does not use recurrence or convolution, there it injects absolute positions to the input representations in the form of positional encodings. For more technical details on the architecture of transformers, one can refer to the original paper “Attention is All You Need.”
However, there exist certain limitations to the use of the transformers model. Attention mechanisms in transformers can cater to only fixed-length text strings. As a result, the text is first to be split into several segments, which results in context fragmentation, which helps especially in the case of long sentences and paragraphs. This results in a significant loss of context leading to a degradation in performance.